Pandas II - Working with DataFrames


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option('max_columns', 50)

We'll be using the MovieLens dataset in many examples going forward. The dataset contains 100,000 ratings made by 943 users on 1,682 movies.


In [2]:
# pass in column names for each CSV
u_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
df_users = pd.read_csv('data/MovieLens-100k/u.user', sep='|', names=u_cols)

r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
df_ratings = pd.read_csv('data/MovieLens-100k/u.data', sep='\t', names=r_cols)

m_cols = ['movie_id', 'title', 'release_date', 'video_release_date', 'imdb_url']
df_movies = pd.read_csv('data/MovieLens-100k/u.item', sep='|', names=m_cols, usecols=range(5))# only load the first five columns

Summary

  1. Inspect
    a) .dtype
    b) .describe()
    c) .head(), .tail(), [i:j]
  2. Select
    a) Column Selection
    b) Row Selection
  3. Sort
    a) .sort() for DataFrames
    b) .order() for Series
  4. Operations
    a) Descriptive Stats
    b) Apply
    b) Bins
    b) Histograms
  5. Split-Apply-Combine
  6. Other
    a) Rename columns
    b) Missing values

1. Inspect

Pandas has a variety of functions for getting basic information about your DataFrame.
The most basic of which is calling your DataFrame by name. The output tells a few things about our DataFrame.

  1. It's an instance of a DataFrame.
  2. Each row is assigned an index of 0 to N-1, where N is the number of rows in the DataFrame. (index can be set arbitrary)
  3. There are 1,682 rows (every row must have an index).
  4. Our dataset has five total columns, one of which isn't populated at all (video_release_date) and two that are missing some values (release_date and imdb_url).
df_movies

a) .dtypes

Use the .dtypes attribute to get the datatype for each column.


In [3]:
print df_movies.dtypes,'\n'

print df_users.dtypes,'\n'

print df_ratings.dtypes,'\n'


movie_id                int64
title                  object
release_date           object
video_release_date    float64
imdb_url               object
dtype: object 

user_id        int64
age            int64
sex           object
occupation    object
zip_code      object
dtype: object 

user_id           int64
movie_id          int64
rating            int64
unix_timestamp    int64
dtype: object 

b) .describe()

Use the .describe() method to see the basic statistics about the DataFrame's numeric columns. Be careful though, since this will return information on all columns of a numeric datatype.


In [4]:
df_users.describe()


Out[4]:
user_id age
count 943.000000 943.000000
mean 472.000000 34.051962
std 272.364951 12.192740
min 1.000000 7.000000
25% 236.500000 25.000000
50% 472.000000 31.000000
75% 707.500000 43.000000
max 943.000000 73.000000

Notice user_id was included since it's numeric. Since this is an ID value, the stats for it don't really matter.

We can quickly see the average age of our users is just above 34 years old, with the youngest being 7 and the oldest being 73. The median age is 31, with the youngest quartile of users being 25 or younger, and the oldest quartile being at least 43.

c) .head(), tail(), [i:j]

By default, .head() displays the first five records of the DataFrame, while .tail() displays the last five.
Alternatively, Python's regular slicing [i:j] syntax works as well.


In [5]:
print df_users.head()


   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213

In [6]:
print df_users.tail(3)


     user_id  age sex occupation zip_code
940      941   20   M    student    97229
941      942   48   F  librarian    78209
942      943   22   M    student    77841

In [7]:
print df_users[20:22]


    user_id  age sex occupation zip_code
20       21   26   M     writer    30068
21       22   25   M     writer    40206

2. Select

a) Column Selection

You can think of a DataFrame as a group of Series (ie: rows) that share an index (ie: column headers). This makes it easy to select specific columns.

Single column selection
Selecting a single column from the DataFrame will return a Series object.


In [8]:
df_users['occupation'].head()


Out[8]:
0    technician
1         other
2        writer
3    technician
4         other
Name: occupation, dtype: object

Multiple columns selection
To select multiple columns, simply pass a list of column names to the DataFrame, the output of which will be a DataFrame.


In [9]:
list_of_cols = ['occupation', 'sex'] 
print df_users[list_of_cols].head()


   occupation sex
0  technician   M
1       other   F
2      writer   M
3  technician   M
4       other   F

b) Row Selection

Row selection can be done multiple ways, but using boolean indexing or individual index .ix() are typically easiest.

Boolean Indexing


In [10]:
# users older than 25
print df_users[df_users.age > 25].head(3), '\n'

# users aged 40 AND male
print df_users[(df_users.age == 40) & (df_users.sex == 'M')].head(3), '\n'

# users younger than 30 OR female
print df_users[(df_users.sex == 'F') | (df_users.age < 30)].head(3)


   user_id  age sex occupation zip_code
1        2   53   F      other    94043
4        5   33   F      other    15213
5        6   42   M  executive    98101 

     user_id  age sex  occupation zip_code
18        19   40   M   librarian    02138
82        83   40   M       other    44133
115      116   40   M  healthcare    97232 

   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067

.ix() method

When you change the indexing of a DataFrame to a specific column, you use the default pandas 0-based index.
Use .ix() method for row selection based on the new index.

Let's set the index to the user_id using the .set_index() method.
NB: By default, .set_index() returns a new DataFrame, so you'll have to specify if you'd like the changes to occur in place.


In [11]:
# Change index column (new DataFrame)
new_df_users = df_users.set_index('user_id')
print new_df_users.head(3)

# Change index column (inplace)
df_users.set_index('user_id', inplace=True)
print df_users.head(3)


         age sex  occupation zip_code
user_id                              
1         24   M  technician    85711
2         53   F       other    94043
3         23   M      writer    32067
         age sex  occupation zip_code
user_id                              
1         24   M  technician    85711
2         53   F       other    94043
3         23   M      writer    32067

In [12]:
# Select users using their respective user_id
print df_users.ix[99], '\n'
print df_users.ix[[1, 50, 300]]


age                20
sex                 M
occupation    student
zip_code        63129
Name: 99, dtype: object 

         age sex  occupation zip_code
user_id                              
1         24   M  technician    85711
50        21   M      writer    52245
300       26   F  programmer    55106

Use the .reset_index() method to reset the default index (the same rule apply for inplace).


In [13]:
df_users.reset_index(inplace=True)
print df_users.head()


   user_id  age sex  occupation zip_code
0        1   24   M  technician    85711
1        2   53   F       other    94043
2        3   23   M      writer    32067
3        4   24   M  technician    43537
4        5   33   F       other    15213

3. Sort

a) .sort() for DataFrames

Use .sort() method to sort DataFrames. Returns a new instance of a Dataframe. (See DOC)

  • column : column name to base the sorting on (list for nested sorting / tuple for multi-index sorting)
  • ascending (True) : sort ascending vs. descending (specify list for multiple sort orders)
  • inplace (False): result is a new instance of DataFrame

In [14]:
# Oldest techicians
df_users.sort('age', ascending=False, inplace=True)
print df_users[df_users.occupation == "technician"][:5]


     user_id  age sex  occupation zip_code
196      197   55   M  technician    75094
440      441   50   M  technician    55013
487      488   48   M  technician    21012
324      325   48   M  technician    02139
457      458   47   M  technician    Y1A6B

b) .order() for Series

Use .order() method to sort Series. Returns a new instance of a Dataframe.

  • ascending (True) : sort ascending vs. descending (specify list for multiple sort orders)
  • inplace (False): result is a new instance of DataFrame

In [15]:
print df_users.zip_code.order()[:3]


185    00000
766    00000
8      01002
Name: zip_code, dtype: object

4. Operations

a) Descriptive Stats

A large number of methods for computing descriptive statistics and other related operations on Series, DataFrame, and Panel. For DataFrames these methods take an axis argument:

  • axis=0 : compute over indexes
  • axis=1 : compute over columns

Most methods produce a lower-dimensional result (aka aggregate functions) :

  • .count(): number of NOT NULL values
  • .nunique(): number of unique NOT NULL values
  • .size() : number of values
  • .min(): minimum
  • .max(): maximum
  • .sum(): sum of values
  • .prod(): product of values
  • .median(): arithmetic median of values
  • .quantile(): sample quantile (value at %)
  • .mean(): mean of values
  • .std(): unbiased standard deviation
  • .var(): unbiased variance
  • .mad(): mean absolute deviation
  • .sem(): unbiased standard error of the mean
  • .skew(): unbiased skewness (3rd moment)
  • .kurt(): unbiased kurtosis (4th moment)

Some methods produce an object of the same size :

  • .rank(): compute data rank (1 through n)
  • .mode(): mode
  • .abs(): absolute value
  • .cumsum(): cumulative sum
  • .cumprod(): cumulative product
  • .cummax(): cumulative maximum
  • .cummin(): cumulative minimum

b) Apply

To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.

  • Tablewise Function Application: .pipe()
  • Row or Column-wise Function Application: .apply()
  • Elementwise function application: .applymap() or .map()

.pipe()

Use .pipe() for method chaining over a DataFrame. (See DOC)
The following two are equivalent :

- f(g(h(df), arg1=1), arg2=2, arg3=3)
- df.pipe(h).pipe(g, arg1=1).pipe(f, arg2=2, arg3=3)

The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular (%>%) (read pipe) operator for R.

.apply()

Use .apply() to apply a function along the axes of a DataFrame, like the descriptive statistics methods. (See DOC)

- df.apply(np.mean, axis=1)
- df.apply(lambda x: x.max() - x.min())

.applymap() / .map()

Use .applymap() on DataFrame or .map() on Series to operate elementwise.
The vectorized function must take a single value and return a single value.(See DOC)

- df.applymap(lambda x: len(str(x)))
- df['colA'].map(lambda x: len(str(x)))

c) Bins

Use pandas.cut() static method to bin numeric values into groups. Useful for discretization. (DOC)

pandas.cut(x, bins) returns an array of the indices (or labels) of the half-open bins to which each value of x belongs.

  • x : array of values to be binned
  • bins : sequence defining the bin edges
  • right (True): boolean indicating whether the bins include the rightmost edge or not ([a,b] or [a,b[)
  • labels (None): array used as labels for the resulting bins

In [16]:
labels = ['0-9', '10-19', '20-29', '30-39', '40-49', '50-59', '60-69', '70-79']
bins = range(0, 81, 10) # [0, 10, 20, 30, 40, 50, 60, 70, 80]
df_users['age_group'] = pd.cut(df_users.age, bins, right=False, labels=labels)
print df_users[27:31] # preview of age bin


     user_id  age sex  occupation zip_code age_group
930      931   60   M    educator    33556     60-69
693      694   60   M  programmer    06365     60-69
751      752   60   M     retired    21201     60-69
89        90   60   M    educator    78155     60-69

d) Histograms

Use .value_counts() Series method to return the counts of unique values (ie frequency). (See DOC)


In [17]:
df_users['occupation'].value_counts().head()


Out[17]:
student          196
other            105
educator          95
administrator     79
engineer          67
dtype: int64

5. Split-Apply-Combine

Use .groupby() method to execute the split-apply-combine strategy for data analysis :

  1. Split the DataFrame into groups based on some criteria (DataFrameGroupBy or SeriesGroupBy)
  2. Apply a function to each group independently
  3. Combine the results into a data structure (DataFrame or Series)

DataFrameGroupBy/SeriesGroupBy Methods (See Doc)

  • .apply(): apply your own or another library's function or list of functions
  • .agg(): aggregate using input function or dict of {column: function}
  • .transform(): transform
  • .filter(): return a copy of a DataFrame excluding elements from groups


In the apply step, we might wish to do one of the following:

  • Aggregation: computing a summary statistic (or statistics) about each group. Some examples:
    • Compute group columns sums and means :
      • gby.agg([np.sum, np.mean])
    • Compute group sizes and counts :
      • gby.agg([np.size, np.mean])
  • Transformation: perform some group-specific computations on every data point. Some examples:
    • Standardizing data (zscore) within group :
      • gby.transform(lambda x: (x - x.mean()) / x.std())
    • Filling NAs within groups with a value derived from each group
      • gby.fillna(x.mean())
  • Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
    • Discarding data that belongs to groups with only a few members :
      • gby.filter(lambda x: x.size() > 100)
    • Discarding data based on the group sum or mean
      • gby.filter(lambda x: x['A'].sum() + x['B'].sum() > 0)
    • Discarding data for missing data
      • gby.dropna(axis=0)

City of Chicago salaries

The City of Chicago is kind enough to publish all city employee salaries to its open data portal. Let's go through some basic groupby examples using this data.


In [18]:
!head -n 3 data/city-of-chicago-salaries.csv


Name,Position Title,Department,Employee Annual Salary
"AARON,  ELVIA J",WATER RATE TAKER,WATER MGMNT,$85512.00
"AARON,  JEFFERY M",POLICE OFFICER,POLICE,$75372.00

Since the data contains a '$' sign for each salary, python will treat the field as a series of strings. We can use the converters parameter to change this when reading in the file.

converters = Dict of functions for converting values in certain columns. Keys can either be integers or column labels


In [19]:
headers = ['name', 'title', 'department', 'salary']
df_chicago = pd.read_csv('data/city-of-chicago-salaries.csv',
                      header=False,
                      names=headers,
                      converters={'salary': lambda x: float(x.replace('$', ''))})
print df_chicago.head()


                    name                     title        department  salary
0        AARON,  ELVIA J          WATER RATE TAKER       WATER MGMNT   85512
1      AARON,  JEFFERY M            POLICE OFFICER            POLICE   75372
2    AARON,  KIMBERLEI R  CHIEF CONTRACT EXPEDITER  GENERAL SERVICES   80916
3    ABAD JR,  VICENTE M         CIVIL ENGINEER IV       WATER MGMNT   99648
4  ABBATACOLA,  ROBERT J       ELECTRICAL MECHANIC          AVIATION   89440

In [20]:
print df_chicago.groupby('department').count().head(3), '\n' # NOT NULL records within each column
print df_chicago.groupby('department').size().head(3) # total records for each department


               name  title  salary
department                        
ADMIN HEARNG     42     42      42
ANIMAL CONTRL    61     61      61
AVIATION       1218   1218    1218 

department
ADMIN HEARNG       42
ANIMAL CONTRL      61
AVIATION         1218
dtype: int64

In [21]:
print df_chicago.groupby('department').agg({'salary': [np.size, np.mean]}).head()


                  salary              
                    size          mean
department                            
ADMIN HEARNG          42  70336.857143
ANIMAL CONTRL         61  57121.455738
AVIATION            1218  70638.249130
BOARD OF ELECTION    110  55728.872727
BOARD OF ETHICS        9  81650.666667

What departments have the most number of distinct title positions ?

  1. Split DataFrame into groups by departement, keep only title column => SeriesGroupBy
  2. Apply .nunique() method
  3. (Combine into Serie)
  4. Order resulting Serie (NB: .order() is for Series, .sort() is for DataFrames)

In [22]:
print df_chicago.groupby('department').title.nunique().order(ascending=False)[:3]


department
WATER MGMNT    153
TRANSPORTN     150
POLICE         130
Name: title, dtype: int64

What department pays best on average ?

  1. Split DataFrame into groups by departement => DataFrameGroupBy
  2. Apply .mean() method
  3. (Combine into DataFrame)
  4. Sort resulting DataFrame according to the salary (NB: .order() is for Series, .sort() is for DataFrames)

In [23]:
print df_chicago.groupby('department').mean().sort('salary', ascending=False).head()


                      salary
department                  
DoIT            93209.939394
BUILDINGS       90720.081322
FIRE            89579.082621
MAYOR'S OFFICE  85251.949091
BUDGET & MGMT   84767.181818

In [24]:
print df_chicago.groupby('department').agg({'salary': [np.size, np.mean]}).sort(('salary', 'mean'), ascending=False).head()


               salary              
                 size          mean
department                         
DoIT               99  93209.939394
BUILDINGS         242  90720.081322
FIRE             4731  89579.082621
MAYOR'S OFFICE     99  85251.949091
BUDGET & MGMT      44  84767.181818

Who is the highest paid employee of each department ?

  1. Split DataFrame into groups by departement, keep only salary column => SeriesGroupBy
  2. Apply .rank() method
  3. (Combine into Serie)
  4. Assign the resulting Serie to a new column of the DataFrame
  5. Sort DataFrame according to salary (NB: .order() is for Series, .sort() is for DataFrames)
  6. Display only first rankers

For the .rank() method, use attributes:

  • ascending=False : to rank high (1) to low (N)
  • method='first' : so that equally high paid people within a department don't get the same rank .

In [67]:
df_chicago['dept_rank'] = df_chicago.groupby('department')['salary'].rank(method='first', ascending=False)
df_chicago.sort('salary', ascending=False, inplace=True)
print df_chicago[df_chicago['dept_rank'] == 1].head()


                         name                     title      department  \
18039     MC CARTHY,  GARRY F  SUPERINTENDENT OF POLICE          POLICE   
8004           EMANUEL,  RAHM                     MAYOR  MAYOR'S OFFICE   
25588       SANTIAGO,  JOSE A         FIRE COMMISSIONER            FIRE   
763    ANDOLINO,  ROSEMARIE S  COMMISSIONER OF AVIATION        AVIATION   
4697     CHOUCAIR,  BECHARA N    COMMISSIONER OF HEALTH          HEALTH   

       salary  dept_rank  
18039  260004          1  
8004   216210          1  
25588  202728          1  
763    186576          1  
4697   177156          1  

In [49]:
print df_chicago[df_chicago['department'] == 'MAYOR\'S OFFICE'].tail(10)


                          name                             title  \
7498         DUBIN,  LINDSAY P  STUDENT INTERN - MAYOR'S FELLOWS   
15278           KOSS,  KELLY K  STUDENT INTERN - MAYOR'S FELLOWS   
19324         MODLIN,  EMILY L  STUDENT INTERN - MAYOR'S FELLOWS   
29745        VILLA,  FABIOLA V  STUDENT INTERN - MAYOR'S FELLOWS   
31312  WITHERSPOON,  KATELYN K  STUDENT INTERN - MAYOR'S FELLOWS   
16182           LE,  THU GIANG  STUDENT INTERN - MAYOR'S FELLOWS   
24861     ROSENBERG,  AMANDA J   STUDENT INTERN - MAYOR'S OFFICE   
4589             CHEN,  YIXIAN  STUDENT INTERN - MAYOR'S FELLOWS   
8062         EPSTEIN,  PAIGE A  STUDENT INTERN - MAYOR'S FELLOWS   
15146            KOCH,  STEVEN          ADMINISTRATIVE SECRETARY   

           department    salary  dept_rank  
7498   MAYOR'S OFFICE  36400.00          2  
15278  MAYOR'S OFFICE  36400.00          3  
19324  MAYOR'S OFFICE  36400.00          4  
29745  MAYOR'S OFFICE  36400.00          5  
31312  MAYOR'S OFFICE  36400.00          6  
16182  MAYOR'S OFFICE  36400.00          7  
24861  MAYOR'S OFFICE  36400.00          8  
4589   MAYOR'S OFFICE  36400.00          9  
8062   MAYOR'S OFFICE  36400.00         10  
15146  MAYOR'S OFFICE      0.96          1  

6. Other

a) Rename columns

Use .rename() method to change columns names.

ex: df.rename(columns={'old_col_name' : 'new_col_name'}, inplace = True)

  • columns : dictionnary containing the transformations to apply
  • inplace (False) : result is a new instance of DataFrame

b) Handling Missing Values

Drop missing values

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
Use .dropna() method to drop rows or columns with missing data (NaN).

ex: df.dropna()

  • axis {(0), 1} : drop rows/columns
  • subset (None) : list of columns/rows to consider
  • inplace (False) : result is a new instance of DataFrame

Fill missing values

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
Use .fillna() method to drop rows or columns with missing data (NaN).

ex: df['col_A'].fillna(df['col_A'].mean())

  • value : value to use to fill holes
  • method {'backfill', 'bfill', 'pad', 'ffill', (None)} : method of propagation to use for filling holes
  • inplace (False) : result is a new instance of DataFrame